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Towards Faster Rates and Oracle Property for 
Low-Rank Matrix Estimation 

Huan Gui* Quanquan Gu^ 


Abstract 

We present a unified framework for low-rank matrix estimation with nonconvex penalties. We first 
prove that the proposed estimator attains a faster statistical rate than the traditional low-rank matrix 
estimator with nuclear norm penalty. Moreover, we rigorously show that under a certain condition on the 
magnitude of the nonzero singular values, the proposed estimator enjoys oracle property (i.e., exactly 
recovers the true rank of the matrix), besides attaining a faster rate. As far as we know, this is the first 
work that establishes the theory of low-rank matrix estimation with nonconvex penalties, confirming the 
advantages of nonconvex penalties for matrix completion. Numerical experiments on both synthetic and 
real world datasets corroborate our theory. 


1 Introduction 

Statistical estimation of low-rank matrices (Srebro et al., 2004; Candes and Tao, 2010; Rohde et ah, 2011; 
Koltchinskii et al., 2011a; Candes and Recht, 2012; Jain et al., 2013) has received increasing interest in the 
past decade. It has broad applications in many fields such as data mining and computer vision. For example, 
in the recommendation systems, one aims to predict the unknown preferences of a set of users on a set of 
items, provided a partially observed rating matrix. Another application of low-rank matrix estimation is 
image inpainting, where only a portion of pixels is observed and the missing pixels are to be recovered based 
on the observed ones. 

Since it is not tractable to minimize the rank of a matrix directly, many surrogate loss functions of the 
matrix rank have been proposed (e.g., nuclear norm (Srebro et ah, 2004; Candes and Tao, 2010; Recht 
et ah, 2010; Negahban and Wainwright, 2011; Koltchinskii et ah, 2011a), Schatten-p norm (Rohde et ah, 
2011; Nie et ah, 2012), max norm (Srebro and Shraibman, 2005; Cai and Zhou, 2013), the von Neumann 
entropy (Koltchinskii et ah, 2011b)). Among those surrogate losses, nuclear norm is probably the most widely 
used penalty for low-rank matrix estimation (Negahban and Wainwright, 2011; Koltchinskii et ah, 2011a), 
since it is the tightest convex relaxation of the matrix rank. 

On the other hand, it is now well-known that penalty in Lasso (Fan and Li, 2001; Zhang, 2010; Zou, 
2006) introduces a bias into the resulting estimator, which compromises the estimation accuracy. In contrast, 
nonconvex penalties such as smoothly clipped absolute deviation (SCAD) penalty (Fan and Li, 2001) and 
minimax concave penalty (MCP) (Zhang, 2010) are favored in terms of estimation accuracy and variable 
selection consistency (Wang et ah, 2013b). Due to the close connection between lx norm and nuclear norm 
(nuclear norm can be seen as an lx norm defined on the singular values of a matrix), nonconvex penalties 
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for low-rank matrix estimation have recently received increasing attention for low-rank matrix estimation. 
Typical examples of nonconvex approximation of the matrix rank include Schatten ip-norm {0 < p < 1) (Nie 
et ah, 2012), the truncated nuclear norm (Hu et ah, 2013), and the MCP penalty defined on the singular 
values of a matrix (Wang et ah, 2013a; Liu et ah, 2013). Although good empirical results have been observed 
in these studies (Nie et ah, 2012; Hu et ah, 2013; Wang et ah, 2013a; Liu et ah, 2013; Lu et ah, 2014), 
little is known about the theory of nonconvex penalties for low-rank matrix estimation. In other words, the 
theoretical justification for the nonconvex surrogates of matrix rank is still an open problem. 

In this paper, to bridge the gap between practice and theory of low-rank matrix estimation, we propose 
a unified framework for low-rank matrix estimation with nonconvex penalties, followed by its theoretical 
analysis. To the best of our knowledge, this is the first work that establishes the theoretical properties 
of low-rank matrix estimation with nonconvex penalties. Our first result demonstrates that the proposed 
estimator, by taking advantage of singular values with large magnitude, attains faster statistical convergence 
rates than conventional estimator with nuclear norm penalty. Furthermore, under a mild assumption on the 
magnitude of the singular values, we rigorously show that the proposed estimator enjoys oracle property, 
which exactly recovers the true rank of the underlying matrix, as well as attains a faster rate. Our theoretical 
results are verified through both simulations and thorough experiments on real world datasets for collaborative 
filtering and image inpainting. 

The remainder of this paper is organized as follows; Section 2 is devoted to the set-up of the problem and 
the proposed estimator; we present theoretical analysis with illustrations on specific examples in Section 3; 
the numerical experiments are reported in Section 4; and Section 5 concludes the paper. 

Notation. We use lowercase letters (a, b,...) to denote scalars, bold lower case letters (a, b,...) for vectors, 
and bold upper case letters (A, B,...) for matrices. For a real number a, we denote by [aj the largest 
integer that is no greater than a. For a vector x, define vector norm as ||x ||2 = \/X]i • We also dehne 

supp(x) as the support of x. Considering matrix A, we denote by Amax(A) and Amin (A) the largest and 
smallest eigenvalue of A, respectively. For a pair of matrices A,B with commensurate dimensions, (A, B) 
denotes the trace inner product on matrix space that (A, B) := trace(A^B). Given a matrix A G 
its (ordered) singular values are denoted by 71 (A) > 72 (A) > • • • > 7 m(A) > 0 where m = min{TOi,TO 2 }. 
Moreover, M = max{mi, 7712 }. We also define H-H for various norms defined on matrices, based on the singular 
values, including nuclear norm ||A||* = spectral norm ||A ||2 = 71 (A), and the Frobenius norm 

||A||f = v^(A, A) = \/Er=i 7 *^(A). In addition, we dehne ||A||oo = rnayii<j<rni,i<k<m 2 Ajk, where Ajk is 
the element of A at row j, column k. 

2 Low-rank Matrix Estimation with Nonconvex Penalties 

In this section, we present a unihed framework for low-rank matrix estimation with nonconvex penalties, 
followed by the theoretical analysis of the proposed estimator. 

2.1 The Observation Model 

We consider a generic observation model as follows: 

yi = (Xj, ©*)-f Ci for * = 1 , 2 ,... ,n, ( 2 . 1 ) 

where is a sequence of observation matrices, and are i.i.d. zero mean Gaussian observation 

noise with variance cr^. Moreover, the observation model can be rewritten in a more compact way as 
y = X(0*) -I- e, where y = (yi,..., y-n)^, £ = (ei, • ■ •, e™)^, and X(-) is a linear operator that X(0*) := 
((Xi, 0*), (X 2 , 0*), • • • ,(X„,©*))^. In addition, we dehne the adjoint of the operator X, X* : M" —>■ 
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]jmixm 2 ^ which is defined as X*(e) = Note that the observation model in (2.1) has been 

considered before by Koltchinskii et al. (2011a); Negahban and Wainwright (2011). 

2.2 Examples 

Low-rank matrix estimation has broad applications. We briefly review two examples: matrix completion and 
matrix sensing. For more examples, please refer to Koltchinskii et al. (2011a); Negahban and Wainwright 
( 2011 ). 

Example 2.1 (Matrix Completion). In the setting of matrix completion with noise, one uniformly observes 
partial entries of the unknown matrix 0* with noise. In details, the observation matrix G jg 

the form of ( 7711 ) 6 ^.( 7712 )^, where ej.{mi) and 6 ^.( 7772 ) are the canonical basis vectors in K™! and 

respectively. 

Example 2.2 (Matrix Sensing). In the setting of matrix sensing, one observes a set of random projections 
of the unknown matrix 0*. More specifically, the observation matrix X^ € j j standard normal 

N{0, 1) entries, so that one makes observations of the form yi = (X^, 0*) -|- e^. It is obvious that matrix 
sensing is an instance of the model (2.1). 


2.3 The Proposed Estimator 

We now propose an estimator that is naturally designed for estimating low-rank matrices. Given a collection 
of n samples Zf = {(7/i,Xi)}”_^, which is assumed to be generated from the observation model ( 2 . 1 ), the 
unknown low-rank matrix 0* G estimated by solving the following optimization problem 

0= argmin ||y - X(0) +'Pa(0), (2.2) 

©GR™ix ™2 2:77 

which includes two components: (i) the empirical loss function £„( 0 ) = {2n)~^\\y — X( 0 )|||; and (ii) the 
nonconvex penalty (Fan and Li, 2001; Zhang, 2010; Zhang et ah, 2012) 7^>(0) with regularization parameter 
A, which helps to enforce the low-rank structure constraint on the regularized M-estimator 0. Considering 
the low rank assumption on the matrices, we apply the nonconvex regularization on the singular values of 0, 
which induces sparsity of singular values, and therefore low-rankness of the matrix. For singular values of 0, 
7 ( 0 ) = ( 71 ( 0 ), 72 ( 0 ),... , 7 m( 0 )), where 7 i( 0 ) > ... > 7 m( 0 ) > 0, we define Vxi®) = 
where p\ is an univariate nonconvex function. There is a line of research on nonconvex regularization and 
various nonconvex penalties have been proposed, such as SCAD (Fan and Li, 2001) and MCP (Zhang, 2010). 
Particularly, we take SCAD penalty as an illustration. Hence, the function p\{-) is defined as follows 


r A|ti, 

= S - 2&A|t| -H A2)/(2(6- 1)), 

I (&+l)AV2, 


if |t| < A, 
if A < \t\ < bX, 
if |t| > 6A, 


where b > 2 and A > 0. The SCAD penalty corresponds to a quadratic spline function with knots at t = A 
and t = bX. In addition, the nonconvex penalty p\(t) can be further decomposed as p\{t) = A|t| -I- q\{t), 
where |t| is the £i penalty and q\{t) is a concave component. For the SCAD penalty, qx(t) can be obtained 
as follows. 


9A(t) 


(1^1 + A) 

2(6-1) 


1(A < \t\ < bX) + 


(6 + l)A2 

2 


2A|t| 


l(|t| >6A). 
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Since the regularization term V\{®) is imposed on the vector of singular values, the decomposability of p\{t) 
is equivalent to the decomposability of Vx{&) as V\{@) = A||©||* + Qa(0), where Qa(®) is the concave 
component that Qa(0) = ||0||* is the nuclear norm. 

Note that the estimator in (2.2) can be solved by proximal gradient algorithms (Gong et ah, 2013; Ji and 
Ye, 2009). 

3 Main Theory 

In this section, we are going to present the main theoretical results for the proposed estimator in (2.2). We 
hrst lay out the assumptions made on the empirical loss function and the nonconvex penalty. 

Suppose the SVD of 0* is 0* = U*r*V*^, where U* e V* € p* ^ diag( 7 *) € 

We can construct the subspaces and as follows 

J-(U*, V*) ;= {A|row(A) C V* and col(A) C U*}, 

^^(U*, V*) .= {A|row(A) _L V* and col(A) _L U*}. 

Shorthand notations J- and are used whenever U*,V* are clear from context. Remark that T is the 
span of the row and column space of 0*, and 0* € T consequently. In addition, njr(-) is the projection 
operator that projects matrices into the subspace T . 

To begin with, we impose two conditions on the empirical loss function £„(•) over a restricted set, known 
as restricted strong convexity (RSC) and restricted strong smoothness (RSS). These two assumptions assume 
that there are a quadratic lower bound and a quadratic upper bound, respectively, on the remainder of 
the first order Taylor expansion of £„(•). The RSC condition has been discussed extensively in previous 
work (Negahban et ah, 2012; Loh and Wainwright, 2013), which guarantees the strong convexity of the loss 
function in the restricted set and helps to control the estimation error ||0 — 0*||f- In particular, we define 
the following subset, which is a cone of a restricted set of directions, 

c = {A e M™i^“^|||nFx(A)|U < 5||nF(A)|U}. 

Assumption 3.1 (Restricted Strong Convexity). For operator X, there exists some k(X) > 0 such that, for 
all A e C, 


£„(© + A) > £„(0) + (V/:„(0), A) + «(X)/2|| A|l|. 

Assumption 3.2 (Restricted Strong Smoothness). For operator X, there exists some oo > p(X) > k(X) such 
that, for all A e C, 


£„(©) + (V£„(0), A) + p(X)/2|| AIII > £„(© + A). 

Recall that £^(0) = (2n)“^||y — X(0)||2. It can be verified that with high probability £„(0) satisfies 
both RSC and RSS conditions for different applications, including matrix completion and matrix sensing. We 
will establish the results for RSC and RSS conditions in Section 3.2. 

Furthermore, we impose several regularity conditions on the nonconvex penalty ’Pa(')j terms of the 
univariate functions p\{-) and 

Assumption 3.3. 

(i) On the nonnegative real line, the function p\{t) satisfies Px{t) = 0, V t > f > 0. 

(ii) On the nonnegative real line, qx{t) is monotone and Lipschitz continuous, i.e., for £ > t, there exists a 
constant C- > 0 such that q^it') — q'x{t) > —(-{t' — t). 
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(iii) Both function qx{t) and its derivative q'x{t) pass through the origin, i.e., ( 7 a(0) = (7a(0) = 0- 

(iv) On the nonnegative real line, \qx{t)\ is upper bounded by A, i.e., |( 7 A(i)| < A. 

Note that condition (ii) is a type of curvature property which determines concavity level of qx{‘), and 
the nonconvexity level of pa(') consequently. These conditions are satisfied by many widely used nonconvex 
penalties, such as SCAD and MCP. For instance, it is easy to verify that SCAD penalty satisfies the conditions 
in Assumption 3.3 with v = bX and C- = l/(& ~ !)■ 


3.1 Results for the Generic Observation Model 


We first present a deterministic error bound of the estimator for the generic observation model, as stated in 
Theorem 3.4. In particular, our results imply that matrix completion via a nonconvex penalty achieves a 
faster statistical convergence rate than the convex penalty, by taking advantage of large singular values. 


Theorem 3.4 (Deterministic Bound for General Singular Values). Under Assumption 3.1, suppose that 
A = © — 0* e C and the nonconvex penalty Vx{®) = satisfies Assumption 3.3. Under the 

condition that k{X) > for any optimal solution 0 of (2.2) with regularity parameter A > 2 n“^||A*(e)|| 2 , 
it holds that, for ri = |S'i|,r 2 = | 5 ' 2 |, 


110 


0 * 


\\f 


^ TyVI 3AyT^ 

- k{X) - C- k{X) - C- ’ 

Si ■.'-y*>u >0 


(3.1) 


where t = ||n_Fg^ (V£„(©*)) where Tsi is a subspace of F associated with Si. 

It is important to note that the upper bound on the Frobenius norm-based estimation error includes 
two parts corresponding to different magnitude of the singular values of the true matrix, i.e., 7 *: (i) 
corresponds to the set of singular values with larger magnitude; and (ii) S 2 corresponds to the set of singular 
values with smaller magnitude. By setting = k(X)/ 2, we have 

11© - 0*||f < 2TyVr/K(X) -I- 6 AvV^/k(X). 


We can see that provided that ri > 0, the rate of the proposed estimator is faster than the nuclear norm based 
one, i.e, (!l(AA/r/K(X))(Negahban and Wainwright, 2011), in light of the fact that r = ||nFgj (V£„(0*)) II 2 is 
order of magnitude smaller than ||V£„( 0*)||2 = A. This would be demonstrated in more details for specific 
examples in Section 3.2. In particular, if 7 * > f, meaning that all the nonzero singular values are larger than 
F, the proposed estimator attains the best-case convergence rate of 2 TyV/K(X). 

In Theorem 3.4, we have shown that the convergence rate of the nonconvex penalty based estimator is 
faster than the nuclear norm based one. In the following, we show that under certain assumptions on the 
magnitude of the singular values, the estimator in ( 2 . 2 ) enjoys the oracle properties, namely, the obtained 
M-estimator performs as well as if the underlying model were known beforehand. Before presenting the 
results on the oracle property, we first formally introduce the oracle estimator, 

©o = argmin £„(©). (3.2) 

®GJ='(U*.V*) 

Remark that the objective function in (3.2) only includes the empirical loss term because the optimization 
program is constrained in the rank-r subspace J^(U*,V*). Since it is impossible to get U*, V* and the rank 
r in practice, i.e., J^(U*, V*) is unknown, the oracle estimator defined above is not a practical estimator. We 
analyze the estimator in (2.2) when k(X) > C-, under which condition Cn,x{®) = £«(©) + 'Px{®) is strongly 
convex over the restricted set C and © is the unique global optimal solution for the optimization problem. 
Moreover, the following theorem shows that under suitable conditions, the estimator in (2.2) is identical to 
the oracle estimator. 
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Theorem 3.5 (Oracle Property). Under Assumption 3.1 and 3.2, suppose that A = 0 — 0* G C and 
Px(&) = Si=iPA(7i(®)) satisfies regularity conditions (i), (ii), (iii) in Assumption 3.3. If k(X) > (- and 7 * 
satisfies the condition that 


min |(7*)i| > + 

ies' ' ~ 


2y^\\X*{e)h ^ 

nniX) 


(3.3) 


where S = supp( 7 *). For the estimator in (2.2) with choice of regularization parameter A > 2n ^||X*(e )||2 + 
2n~^ y/r k{X) , we have that 0 = &o, indicating rank( 0 ) = rank( 0 o) = rank( 0 *) = r. 

Moreover, we have. 




where r = ||njr(V£„( 0 *)) 

Theorem 3.5 implies that, with a suitable choice of regularization parameter A, if the magnitude of the 
smallest nonzero singular value is sufficiently large, i.e., satisfying (3.3), the proposed estimator in (2.2) is 
identical to the oracle estimator. This is a very strong result because we do not even know the subspace 
T. The direct consequence is that the obtained M-estimator exactly recovers the rank of the true matrix, 
0*. Moreover, as Theorem 3.5 is a specific case of Theorem 3.4 with ri = r, we immediately have that the 
convergence rate in Theorem 3.5 corresponds to the best-case convergence rate in (3.1), which is identical to 
the statistical rate of the oracle estimator. 


3.2 Results for Specific Examples 

The deterministic results in Theorem 3.4 and Theorem 3.5 are fairly abstract in nature. In what follows, we 
consider the two specific examples of low-rank matrix estimation as in Section 2.2, and show how the results 
obtained so far yield concrete and interpretable results. More importantly, we rigorously demonstrate the 
improvement of the proposed estimator on statistical convergence rate over the traditional one with nuclear 
norm penalty. More results on oracle property can be found in Appendix, Section C. 


3.2.1 Matrix Completion 

We first analyze the example of matrix completion, as discussed earlier in Example 2.1. It is worth noting 
that under a suitable condition on spikiness ratio^, we can establish the restricted strongly convexity, as 
stated in Assumption 3.1. 

Corollary 3.6. Suppose that A = 0 — 0* G C, the nonconvex penalty Px(&) satisfies Assumption 3.3, 
and 0 * satisfies spikiness assumption, i.e., || 0 *||oo < «*, then for any optimal solution 0 to the slight 
modification of ( 2 . 2 ), i.e., 

©= argmin H?/- X(©)+ 7 ^a( 0), subject to ||©||oo < 

©^]RuriXm2 

there are universal constants Ci,...,C' 5 , with regularity parameter A > Cscr ^yiog M/(nm) and k = 
Ci/{mim 2 ) > C-) it holds with probability at least 1 — C^/M that 


x/mim2 


|0 — 0 *||f < max{a*, cr} 


Ciri 


log M 


a 


r 2 M log M 


^It is insufficient to recover the low-rank matrices due to its infeasibility of recovering overly “spiky” matrices which has very 
few large entries. Additional assumption on spikiness ratio is needed. Details on spikiness are given in Appendix, Section C.l. 
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Remark 3.7. Corollary 3.6 is a direct result of Theorem 3.4. Recall the convergence rate^ of matrix 
completion with nuclear norm penalty due to Koltchinskii et al. (2011a); Negahban and Wainwright (2012), 
which is as follows 


y/mim2 


O ( max{a*, a} 


rM log M 
n 


(3.4) 


It is evident that if ri > 0, i.e., there are ri singular values of 0* larger than v, the convergence rate 
obtained by a nonconvex penalty is faster than the one obtained with the convex penalty. In the worst case, 
when all the singular values are smaller than v, our result reduced to (3.4) with r 2 = r. Meanwhile, if the 
magnitude of singular values satishes the condition that min^gsy* > v, i.e., ri =r,Si = S, the convergence 
rate of our results is log M/n). In Koltchinskii et al. (2011a); Negahban and Wainwright (2012), 

the authors proved a minimax lower bound for matrix completion, which is 0{yjrM/n). Our result is not 
contradictory to the minimax lower bound because the lower bound is proved for the general class of low 
rank matrices, while our results take advantage of the large singular values. In other words, we consider a 
specihc (potentially smaller) class of low rank matrices with both large and small singular values. 


3.2.2 Matrix Sensing With Dependent Sampling 

In the example of matrix sensing, a more general model with dependence among the entries of is 
considered. Denote vec(Xi) G as the vectorization of X^. For a symmetric positive definite matrix 

S G jg called S-ensemble (Negahban and Wainwright, 2011) if the elements of observation 

matrices X^’s are sampled from vec(Xi) ~ JV(0,S). Define 7r^(S) = supnun^^ijv|| 2 =i where 

X G jg a random matrix sampled from the S-ensemble. Specifically, when S = I, it can be verified 

that 7r(I) = 1, corresponding to the classical matrix sensing model where the entries of X^ are independent 
from each other. 


Corollary 3.8. Suppose that A = © — 0* G C and the nonconvex penalty 7 ^a(®) satisfies Assumption 3.3, 
if the random design matrix X^ G jg sampled from the S-ensemble and Aniin(S) is the minimal 

eigenvalue of S, there are universal constants C'i,...,C'6, such that, if k(X) = C' 3 Amin(X) > for any 
optimal solution © of (2.2) with A > C'4cr7r(S)(A/mi/n + 1 / 7712 / 71 ), it holds with probability at least 
1 — C5 exp ( — Ce{mi + 7712 )) that 

||0 _ 0*11^ < . [c,r^ + C 2 V^] ■ 

Remark 3.9. Similarly, Corollary 3.8 is a direct consequence of Theorem 3.4. The problem has been studied 
by Negahban and Wainwright (2011) via convex relaxation, with the following estimator error bound 


I!© 



cr7r(S)-\/rMA 

'\/^'^min(X) J 


(3.5) 


When there are ri > 0 singular values that are larger than u, the result obtained in Corollary 3.8 implies that 
the convergence rate of the proposed estimator is faster than (3.5). When ri = r, we obtain the best-case 
convergence rate of ||© — ©*||f = O{a'!T{T.)r/■ In the worst case, when ri =0,7-2 = r, the 
result in Corollary 3.8 reduces to (3.5). 

^Similar statistical convergence rate was obtained in Negahban and Wainwright (2012) for nonuniform sampling schema. 
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Figure 1: Simulation Results for Matrix Completion and Matrix Sensing. The size of matrix is m x m, 
and the rank is r = [log^mj. Figure l(a)-l(c) correspond to matrix completion, where the rescaled sample 
size is N = n/{rmiogm). Figure l(d)-l(f) correspond to matrix sensing where the rescaled sample size is 
N = n/{rm). 


4 Numerical Experiments 


In this section, we study the performance of the proposed estimator by various simulations and numerical 
experiments on real-word datasets. It it worth noting that we study the proposed estimator with C_ < k(X), 
which can be attained by setting 6 = I -|- 2 /k(X) for the SCAD penalty. 


4.1 Simulations 

The simulation results demonstrate the close agreement between theoretical upper bound and the numerical 
behavior of the M-estimator. Simulations are performed for both matrix completion and matrix sensing. In 
both cases, we solved instances of optimization problem (2.2) for a square matrix 0* G For ©* with 

rank r, we generate 0* = ABC^^, where A, C G j^rnxm right singular vectors of a random 

matrix, and set B to be a diagonal matrix with r nonzero entries, and the magnitude of each nonzero entries 
is above = Xb, i.e., ri = r. The regularization parameter A is chosen based on theoretical results with tr^ 
assumed to be known. 

In the following, we report detailed results on the estimation errors of the obtained estimators and the 
probability of exactly recovering the true rank (oracle property). 

Matrix Completion. We study the performance of estimators with both convex and nonconvex penalties 
for m G {40, 60, 80}, and the rank r = [log^ m\. X^’s are uniformed sampled over A, with the variance of 
observation noise = 0.25. For every configuration, we repeat 100 trials and compute the averaged mean 
squared Frobenius norm error ||0 — 0*||p/to^ over all trials. 

Figure l(a)-l(c) summarize the results for matrix completion. Particularly, Figure 1(a) plots the mean- 























































Table 1: Results on image recovery in terms of RMSE (xlO mean ± std). 


Image SVP SoftImpute AltMin TNG RIMP Nuclear SCAD 


Lenna 

3.84 

± 

0.02 

4.58 

± 

0.02 

4.43 

± 

0.11 

5.49 

± 

0.62 

3.91 

± 

0.03 

5.05 

± 

0.17 

2.81 

± 

0.02 

Barbara 

4.49 

± 

0.04 

5.23 

± 

0.03 

5.05 

± 

0.05 

6.57 

± 

0.92 

4.71 

± 

0.06 

6.48 

± 

0.53 

4.75 

± 

0.02 

Clown 

3.75 

± 

0.03 

4.43 

± 

0.05 

5.44 

± 

0.41 

6.92 

± 

1.89 

3.89 

± 

0.05 

3.70 

± 

0.24 

2.82 

± 

0.01 

Crowd 

4.49 

± 

0.04 

5.35 

± 

0.07 

4.78 

± 

0.09 

7.44 

± 

1.23 

4.88 

± 

0.06 

4.44 

± 

0.18 

3.67 

± 

0.07 

Girl 

3.35 

± 

0.03 

4.12 

± 

0.03 

5.01 

± 

0.66 

4.51 

± 

0.52 

3.06 

± 

0.02 

4.77 

± 

0.34 

2.06 

± 

0.01 

Man 

4.42 

± 

0.04 

5.17 

± 

0.03 

5.17 

± 

0.17 

6.01 

± 

0.62 

4.61 

± 

0.03 

5.44 

± 

0.45 

3.41 

± 

0.03 


Table 2: Recommendation results measured in term of the averaged RMSE. 


Dataset 

SVP 

SoftImpute 

AltMin 

TNC 

RIMP 

Nuclear 

SCAD 

Jester1 

4.7318 

5.1211 

4.8562 

4.4803 

4.3401 

4.6910 

4.1733 

Jester2 

4.7712 

5.1523 

4.8712 

4.4511 

4.3721 

4.5597 

4.2016 

Jester3 

8.7439 

5.4532 

9.5230 

4.6712 

4.9803 

5.1231 

4.6777 


squared Frobenius norm error versus the raw sample size, which shows the consistency that estimation 
error decreases when sample size increases, while Figure 1(b) plots the MSE against the rescaled sample 
size N = n/(rm\ogm). It is clearly shown in Figure 1(b) that, in terms of estimation error, the proposed 
estimator with SCAD penalty outperforms the one with nuclear norm, which aligns with our theoretical 
analysis. Finally, the probability of exactly recovering the rank of underlying matrix is plotted in Figure 1(c), 
which indicates that with high probability the rank of underlying matrix can be exactly recovered. 

Matrix Sensing. For matrix sensing, we set the rank r = 10 for all m € {20,40,80}. ©* is generated 
similarly as in matrix completion. We set the observation noise variance = 1 and S = I, i.e., the entries of 
Xi are independent. Each setting is repeated for 100 times. 

Figure l(d)-l(f) correspond to results of matrix sensing. The Frobenius norm ||© — ©*||j’ is reported in 
log scale. Figure 1(d) demonstrate how the estimation errors scale with m and n, which aligns well with our 
theory. Also, as observed in Figure 1(e), the estimator with SCAD penalty has lower error bounds compared 
with the one of nuclear norm penalty. At last, it shows in Figure 1(f) that, empirically, the underlying rank 
is perfectly recovered by the nonconvex estimator when n is sufficiently large {n > Srm). 


4.2 Experiments on Real World Datasets 

In this section, we apply our proposed matrix completion estimator to two real-world applications, image 
inpainting and collaborative filtering, and compare it with some existing methods, including singular value 
projection (SVP) (Jain et ah, 2010), Trace Norm Constraint (TNC) (Jaggi and Sulovsky, 2010), alternating 
minimization (AltMin) (Jain et ah, 2013), spectral regularization algorithm (SoftImpute) (Mazumder et ah, 
2010), rank-one matrix pursuit (RIMP) (Wang et ah, 2014), and nuclear norm penalty (Negahban and 
Wainwright, 2011). 

Image Inpainting We select 6 images ^ to test the performance of different algorithms. The matrices 
corresponding to selected images are of the size 512 x 512. We project the underlying matrices into the 
corresponding subspaces associated with the top r = 200 singular values of each matrix, by which we can 
guarantee that the problem being solved is a low-rank one. In addition, we randomly select 50% of the entries 
as observations. Each trial is repeated 10 times. The performance is measured by root mean square error 

^The images can be downloaded from http://www.utdallas.edu/-cxcl23730/mh_bcs_spl.htinl. 
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(RMSE) (Jaggi and Sulovsky, 2010; Shalev-Shwartz et ah, 2011), summarized in Table 1. As shown in Table 1, 
the estimator obtained with SCAD penalty achieves the best performance, and significantly outperforms the 
other algorithms on all pictures except Barbara. Moreover, the estimator with SCAD penalty has smaller 
RMSE for all pictures, compared with the nuclear norm based estimator, which backs up our theoretical 
analysis, and the improvement is significant compared with some specific algorithms. 

Collaborative Filtering Considering the matrix completion algorithms for recommendations, we demon¬ 
strate using 3 datasets: Jesterl^, Jester2 and JesterS, which contain rating data of users on jokes, with 
real-valued rating scores ranging from —10.0 to 10.0. The sizes of these matrices are {24983, 23500, 24983} x 100, 
containing 10®, 10®, 6 x 10® ratings, respectively. We randomly select 50% of the ratings as observations, and 
make predictions over the remaining 50%. Each run is repeated for 10 times. According to the numerical 
results summarized in Table 2, we observe that the proposed estimator (SCAD) has the best performance 
among all existing algorithms. In particular, the estimator with SCAD penalty is better than the estimator 
with nuclear norm penalty, which agrees well with the results obtained. 


5 Conclusions 

In this paper, we proposed a unified framework for low-rank matrix estimation with nonconvex penalties for 
a generic observation model. Our work serves as the bridge to connect practical applications of nonconvex 
penalties and theoretical analysis. Our theoretical results indicate that the convergence rate of estimators 
with nonconvex penalties is faster than the one with the convex penalty by taking advantage of the large 
singular values. In addition, we showed that the proposed estimator enjoys the oracle property when a mild 
condition on the magnitude of singular values is imposed. Extensive experiments demonstrate the close 
agreement between theoretical analysis and numerical behavior of the proposed estimator. 
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A Background 

For matrix 0* G which is exactly low-rank and has rank r, we have the singular value decomposition 

(SVD) form of ©* = u*r*V*^, where U* € V* S are matrices consist of left and right 

singular vectors, and F* = diag( 7 {,... , 7 *) G Based on U*, V*, we define the following two subspaces 

of 

^■(U*, V*) := {A|row(A) C V* and col(A) C U*}, 


and 


V*) := {A|row(A) _L V* and col(A) _L U*}, 

where A G jg arbitrary matrix, and row(A) C ]R™ 2 ^ col(A) C K™! are the row space and column 

space of the matrix A, respectively. We will use the shorthand notations of T and whenever (U*, V*) 

“^The Jester dataset can be downloaded from http://eigentaste.berkeley.edu/dataset/. 
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are clear from the context. Define njr,njri as the projection operators onto the subspaces J- and 

n^(A) = u*u*^av*v*t, 

n^r(A) = -U*U*^)A(I„, - v*v*^). 

Thus, for all A € have its orthogonal complement A" with respect to the true low-rank matrix 

0* as follows: 

A" =(I^, - U*U*T) A(I^, - V*V*T), 

A' =A - A", ’ 

where A' is the component which has overlapped row and column space with 0*. Negahban et al. (2012) 
gives a detailed discussion about the concept of decomposibility and a large class of decomposable norms, 
among which the decomposability of the nuclear norm and Frobenius norm is relevant to our problem. For 
low-rank estimation, we have the equality that ||0* 4- A"||* = ||0*||* -I- || A"||* with A" defined in (A.l). 

B Proof of the Main Results 

B.l Proof of Theorem 3.4 

We first define Cn,\{-) as follows, 

£„.a(0)=/:„(0) + Qa(0). 

Based on the the restrict strongly convexity of and the curvature parameter of the non-convex penalty, if 
k(X) > C_, we have the restrict strongly convexity of as stated in the following lemma. 

Lemma B.l. Under Assumption 3.1, if it is assumed that 0i — 02 € C, we have 

£„.a(©2) > 4.a(0i) + (V£„,a(0i), 02 - 0l) + '^^1102 - 0l|||. 

Proof. Proof is provided in Section D.l. □ 

In the following, we prove that A = 0 — 0* lies in the cone C, where 

c = {A e M™i^“^|||n^r(A)|U < 5||n^(A)|U}. 

Lemma B.2. Under Assumption 3.1, the condition k{X) > and the regularization parameter A > 
2 ||X*(e)|| 2 /n, we have 

||n^(©-0*)||^ <5||n^r(©-0*)||^. 

Proof. Proof is provided in Section D.2. □ 

Now we are ready to prove Theorem 3.4. 

Proof of Theorem 3.4- According to Lemma B.l, we have 

£„,a(©) > £„.a(0*) + (V£„.a(©*), 0 - 0*) + - 0*111, (B.l) 

£n,A(©*) > £„.a(©) + (V£„.a(©), ©*-©)+ - ©III.. (B.2) 
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Meanwhile, since || • ||» is convex, we have 


(B.3) 

(B.4) 


A|10|U > A||©*|U + A(0-0*,W*), 
A||0*||* > A||©|U + A(0* - 0, W*). 


Adding (B.l) to (B.4), we have 

0 > (V£„,a(0*) + AW*, 0 - ©*) + (V£„,a(©) + AW, ©*-©) + {k{X) - C-)|i0 - 0*||1.. 

Since © is the solution to the SDP (2.2), © satisfies the optimality condition (variational inequality), for any 
0 ' e it holds that 

ma;x(V£„,A(©)+AW,0-©') <0, 

which implies 

(V£„.a(©) + AW,©*-©) >0. 

Hence, 

(ac(X) - C-)l!© - 01 !f < (V£„.a(©*) + AW*, 0* - 0) 

< (n^^(V£„.A(©*) + AW*), 0* - ©) + (n^(V£„,A(0*) + AW*),0* - ©). (B.5) 

Recall that 7 * = 7 ( 0 *) is the vector of (ordered) singular values of 0*. In the following, we decompose (B.5) 
into three parts with regard to the magnitude of the singular values of 0 *. 

( 1 ) i G S‘^ that ( 7 *)i = 0 ; 

(2) i G Si that ( 7 *)i > v, 

(3) i G S 2 that V > {j*)i > 0. 

Note that S'! U £'2 = S. 

(1) For i G 5'^, it correspond to the projector Iljri)-) since 7 (n_ 7 ri(©*)) = ( 7 *)s<= = 0. 

Based on the regularity condition (iii) in Assumption 3.3 that 9 a( 0 ) = 0, we have that VQa( 0*) = 
U*g^(r*)V*^ where T* € is the diagonal matrix with diag(r*) = 7 *, we have 

n^4VQA(©*)) = (I™, -u*u*^)u*g),(r*)v*^(i„, - v*v*^) 

= (u* - u*)q)^(r*)(v*T - v*^) 

= 0 . 


Meanwhile, we have 

||n^r(v/:„(0*))||2 < ||v£„(©*)||2 = < A. 

For Z* = -A-in^i(V£„(0*)), we have W* = U*V*^ + Z* e a||©*|U because ||Z *||2 < 1 and Z* e 
which satisfies the condition of W* to be subgradient of ||©*||». With this particular choice of W*, we have 


(V£„(©*) + AW*) = (V£„(0*)) + AZ* = 0, 
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which implies that 


(v/:„,a(0*) + AW*), 0* - 0) = (0, 0* - 0) = 0. (b.6) 

(2) Consider i G Si that ( 7 *)i > i'. Let |S'i| = ri. Define a subspace of associated with Si as follows 

J's^(U*, V*) := {A e M’”i^'”^|row(A) C and col(A) C U^J, 

where and is the matrix with the i**' row of U* and V* where i G Si. 

Recall that Vx{&*) = Qa(®*) + A||0*||*. We have 

VPaC©*) = VQa(®*) + A(U*V*^ + Z*). 

Projecting VPa(®*) into the subspace J^Sn we have 

(VPa(®*)) = (VQa(®*) + AU*V*^ + AZ*) 

=vsAinjiny+^^*sAny 

= VsMxir*s,) + \ls,){V*sy, 

where Fg^ € and ( 9 A(rgJ + AIs^) is a diagonal matrix that ( 9 A(rgJ + Alg^).. = 0 for z ^ Si, and 

for all i G Si, 


{q'^{r*sj + Ms,),, = q'xih*),) + A = p'a((7*)*) = 0, 

where the last equality holds because p\{-) satisfies the regularity condition (i) with ( 7 *)^ > for z € S'!. 
Thus, we have ( 7 l,(DgJ + AIs^ = 0, which indicates that FE^g^ (VPa(®*)) = 0. Therefore, we have 

{Urs, (V£„,a(®*) + AW*),0* - 0) = {n^s, (V/:„(0*) + VPa(®*)), ®* - ®) 

= (n^g^ (v/:„( 0 *)),n^g^ ( 0 * - ©)) 

< ||n^g^ (v/:„( 0 *)) II 2 • ||n^g^ ( 0 * - ©) ||^, 

where the last inequality is derived from the Holder inequality. What remains is to bound ||F[^g^ (©* — ©) ||^. 
By the properties of projection on to the subspace Ts ,, we have 

||n^g^ ( 0 * - 0 ) 11 ^ < V^lln^g^ ( 0 * - 0 ) 11 ^ < VLTII©* - 0 ||^, 

where the second inequality is due to the fact that rank(njrs^ (0* — ®)) < i"i. Therefore, we have 

(n^g^ (v£„,a(0*) + AW*), 0 * - 0 ) < AAT||n^g^ (v/:„( 0 *))• ||©* - ©||^. (b.7) 

(3) Finally, consider i G S 2 that ( 7 *)^ < Let | 5 ' 2 | = r 2 . Define a subspace of associated with S 2 as 
follows 


J-g,(U*, V*) := {A e M’”i><™^|row(A) C and col(A) C U^J, 

where Ug^ and Vg^ is the matrix with the z**' row of U* and V* where z € S' 2 - It is obvious that for all 
A e the following decomposition holds 

n^(A) = n^g^(A) + n^gjA). 
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In addition, since U*, V* are unitary matrices, we have 

J'si C and C J'i;, 

where denote the complementary subspace of J^Si and respectively. Similar to analysis in (2) 

on Si, we have 


where is a diagonal matrix that = 0 for j ^ S' 2 , and for all i G S 2 , (g^(rg^)).. = 

since {j*)i < v and q\{-') satisfies the regularity condition (iv). Therefore 

(VQa(©*)) II 2 = max (gUnj),, < A. (B.8) 

2E02 

Meanwhile, we have 


||n^,JAW*)||2 < ||n^(AU*V*T)||2 = A, (B.9) 

where the first inequality is due the fact that € J-, and last equality comes from the fact that ||U*V*^ = 

1. Therefore, we have 

l|n^..(^w*)||2< A. (B.IO) 

In addition, we have the fact that ||llj^g^ (V>C„(0*)) II 2 ^ ||v/:„(0*)||2 < A., which indicates that 

(v£„.a( 0*) + AW*), 0* - 0) = (v/:„(0*) + vQa(0*) + aw*), 0 * - 0 ) 

= (V£„(0*)),0* - 0) + (vQa( 0*)),0* - 0) + (n^,jAW*),0* - ©) 

< n|n^,Jv/:„(0*))||2 + ||n^,^(VQ;,(0*))||^ + ||n^,^(AW*)||J||n^,j0*-©)||^, 


where the last inequality is due to Holder’s inequality. Since we have obtained the bound for each term, as 
in (B.8), (B.9), (B.IO), we have 

(V£„.a( 0*) + AW*), 0* - 0) < 3A||n^,^ (0* - ©)||, 

<3AV^^||©*-©||j^, (B.ll) 


where the last inequality utilizes the fact that rank(njrs 2 (©* ~ ©)) < ?' 2 - 
Adding (B.6), (B.7), and (B.ll), we have 

(k(X) - C-)||0 - ©111 < (V£„,a(©*) + AW*, 0* - 0) 

< (V£„(©*)) II 2 • II©* - ©11^ + 3Av^||©* - ©IIf, 


which indicate that 


i© - 0* 


If < 




t(x) - C- 


|n^,^ (v/:„ (©*))! 


3AyT^ 

k(x) - C- ■ 


This completes the proof. 


□ 
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B.2 Proof of Theorem 3.5 

Before presenting the proof of Theorem 3.5, we need the following lemma. 


Lemma B.3 (Deterministic Bound). Suppose 0* S rank j., X{-) satisfies RSC with respect to 

C. Then the error bound between the oracle estimator 0o and true 0* satisfies 


© 0 - 0 * 


2 v ^|| n ^( v /:„( 0*))||2 

k{X) 


(B.12) 


Proof. Proof is provided in Section D.3. □ 

Proof of Theorem 3.5. Suppose W € i9||©||*, since © is the solution to the SDP (2.2), the variational 
inequality yields 

ma;x(©-0',V£„,A(©) + AW) < 0. (B.13) 


In the following, we will show that there exists some Wo € 9||©o||* such that, for all 0' € 


nmx (0o - ©', V£„,a(0o) + AWo) < 0. 


(B.14) 


Recall that £„^a(0) = C,n{®) + Qa( 0)- By projecting the components of the inner product of the LHS 
in (B.14) into two complementary spaces P and we have the following decomposition 

(©o-©',V£„,a(©o) + AWo) 

= (n^(©o - &), V£„,a(®o) + AWo) + (©o - ©'), V£„.a(©o) + AWo). (B.15) 

'-V-^ '-V-^ 

h h 


Analysis of Term p. Let 7 * = 7 ( 0 *), 70 = 7 (©o) be the vector of (ordered) singular values of 0* and 
©o, respectively. By the perturbation bounds for singular values, the Weyl’s inequality (Weyl, 1912), we 
have that 


max |( 7 *)i - (7o)i| < ||0* - 0 OII 2 < H®* “ ®o 


Since Lemma B.3 provides the Frobenius norm on the estimation error 0* — ©Oj we obtain that 

m«|(7-).-(7o)<|<;^||V(e)||,. 

If it is assumed that S = supp((T*), we have [S'! = r. The triangle inequality yields that 


min|(%)i| = min |( 7 o)i - ( 7 *), + ( 7 *)^ > -max|(% - 7 *)i| +min 1 ( 7 *)^ 


ieS 


ies 


es 


ies 


> - 


2yP 

riK(X) 




where the inequality on the second line is derived based on the condition that min^gs | ( 7 *)^ | > v + 
2n~^y5r\\P* (e)\\^ / k{P) . Based on the definition of oracle estimator (3.2), ©o G P, which implies rank(©o) = 
r. Therefore, we have 


( 70)1 > ( 70)2 > • ■ ■ > (7o)r >v>0 = (7o)r+i = (7o)m = 0. (B.16) 
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By the definition of Oracle estimator, we have ©o = U*roV*^, where Fq € is the diagonal matrix 
with diag(ro) = 7o- Since V\{&) — Q\{®) + A||0||*, we have 

n^(ViPA(0o)) = n^(VQA(0o) + A9||0o||J 

= U^{V*q'^{fo)V*^ + AU*V*T + AZo) (B.17) 

= U*(gl((fo)s)+AI,)v*T, 

where Zq G ||Zo ||2 < 1, and (ro)s = Fq G is a diagonal matrix with diag((Fo)s) = {jo)s- The 
first equality in (B.17) is based on the definition of VQa(') and 9|| • ||*, while the second is to simply project 
each component into the subspace Since p\{t) = q\{t) + A|t|, we have p'x{t) = + At for all t > 0. 

Consider the diagonal matrix g(^((Fo)s) + AI^, we have the {i G S) element on the diagonal that 

Ux{ii'o)s) +AI,.) =q'^[{^o)^) +A = Pa((7o)*)- 

Since px{ ) satisfies the regularity condition (ii), that p'x{t) = 0 for all t>ix,we have p'x{{lo)i) = 0 for i G S', 
in light of the fact that (7o)i > > 0. Therefore, the diagonal matrix q'x[{ ^o)s) + Air = 0, substituting 

which into (B.17) yields 

n^(ViPA(©o)) =0. (B.18) 

Since ©o is a minimizer of (3.2) over we have the following optimality condition that for all 0' G 

inax(n ^(©0 - 0'), V/:„(©o)) < 0. (B.19) 

Substitute (B.18) and (B.19) into item Ji, we have for all Wo G i9||©o|U, 
inax(n ^(©0 - ©'), V£„.a(©o) + AWq) 

= inax(n^(©o - ©0, V/:„(©o)) +inax(n^(©o - 0'), n^(ViPA(©o))) (B.20) 

< 0 . 

Analysis of Term l 2 - By definition of VQa( 0), and the condition that q'x{-) satisfies the regularity condition 
(hi) in Assumption 3.3, we have the SVD of VQa(®o) as VQa(0o) = U* 9 (^(Fo)V*^, where Fo G is 
a diagonal matrix. Projecting VQa(0o) into yields that 

n^^(VQA(0o)) = 

= (U* - V*)q'x{iTo)s^) (V*^ - V*T) 

= 0. 

Therefore, 

h = (n^x (-©'), (v/:„(©o) + AWo)). 

Moreover, the triangle inequality yields 

||V/:„(0o)||2 < ||V£„(0*)||2 + ||V£„(0*) - V£„(©o )||2 

< ||V£„(0*)||2 + ||V£„(0*) - V£„(©o)||^ 

< ||V£„(0*)||2 + p(T)||©*-©o||p, (B.21) 
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where the second inequality comes from the fact that ||V£„(©*) — V£„(©o )||2 < ||V£„(©*) — V£„(©o)||_f, 
while the last inequality is obtained by the restricted strong smoothness (Assumption 3.2), which is equivalent 
to 


||V£„(©) - V£„(© + Ao)\\p < p(X)|| Ao 

over the restricted set C; since njrv(Ao) = 0, it is evident that Aq G C. 
Substitute (B.12) of Lemma B.3 into (B.21), we have 


n ^^( v /:„( 0 o )) < || v /:„(© o )||2 < || v £„(©*)||2 + 


VMH), 

nK{X) ' 




where the last inequality follows from the choice of A. 

By setting Zq = — (V£„(©o)), such that Wo = U*V*^ + Zq € 9||©o||* since Zq satisfies the 

condition Zq € , IIZ 0 II 2 < 1, we have 

n^^(v/:„(©o) + AWo) =0, 


which implies that 


/2 = (n^4-©0,o) = o. (B.22) 

Substitute (B.20) and (B.22) into (B.15), we obtain (B.14) that 

nmx (©o - ©', V£„,a(0o) + AWq) < 0. 

Now we are going to prove that ©o = ©*. 

Applying Lemma B.l, we have 

4.a(©) > £„.a(®o) + (V£„.a(©o),® - ®o) + - ©II?., (B.23) 

£n,A(©o) > £„,a(0) + (V£„.a(©), ©o - ©) + ^®^||©o - ©II?,- (B.24) 

On the other hand, because of the convexity of nuclear norm || • ||*, we obtain 

A||©||^ > A||©oL + A(0 - ©o, Wo), (B.25) 

A||0oL > A||©L + A(©o - ©, W). (B.26) 

Add (B.23) to (B.26), we obtain 


0 > (V£„,a(0) + AW, ©o - ©) + (V£„,a(0o) + AWo, © ~ ©o) +(«(X) - C-) ||©o - ©H?,- (B.27) 

'- V -' '- V -" 

I3 li 

Analysis of Term Is. By (B.13), we have 

(V£„,a(©) + AW, © - ©o) < ma-x (V£„,a(©) + AW, ©-©')< 0. (B.28) 

Therefore I 3 > 0. 

Analysis of Term I 4 . By (B.14), we have 

(V£„,a(0o) + AWo, ©o - ©) < max (V£„,a(©o) + AWo, ©o - ©') < 0. (B.29) 
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Therefore > 0. Substituting (B.28) and (B.29) into (B.27) yields that 

(k(x)-C-)||©o-©||?.<o, 


which holds if and only if ©o = ©, because k(X) > 
By Lemma B.3, we obtain the error bound 


0 - 0 * 


= ©o-©* 




2v^||n^(V£„(©*))|| 

k(X) 


which completes the proof. 


□ 


C Proof of the Results for Specific Examples 

In this section, we provide the detailed proofs for corollaries of specific examples presented in Section 3.2. 
We will first establish the RSC condition for both examples, followed by proofs of the corollaries and more 
results on oracle property respecting two specific examples of matrix completion. 


C.l Matrix Completion 


As shown in (Candes and Recht, 2012) with various examples, it is insufficient to recover the low-rank matrix, 
since it is infeasible to recover overly “spiky” matrices which have very few large entries. Some existing 
work (Candes and Recht, 2012) imposes stringent matrix incoherence conditions to preclude such matrices; 
these assumptions are relaxed in more recent work (Negahban and Wainwright, 2012; Gunasekar et al., 2014) 
by restricting the spikiness ratio, which is defined as follows: 


Q:sp(©) 


ymiTO2||©|| OO 

II©IIf 


Assumption C.l. These exists a known a*, such that 


a.p(0-)||e‘||p ^ . 

For the example of matrix completion, we have the following matrix concentration inequality, which 
follows from Proof of Corollary 1 in Negahban and Wainwright (2012). 


Proposition C.2. Let uniformly distributed on X, and {^k}k=i b® ^ finite sequence of independent 
Gaussian variables with variance There exist constants Ci, C 2 that with probability at least 1 — C 2 /M, 
we have 


1 

n 


i=l 


< CiCT 


MlogM 

mim2n 


Furthermore, the following Lemma plays a key rule in obtaining faster rates for estimator with nonconvex 
penalties. Particularly, the following Lemma will provide an upper bound on ||IIjf-(V£„(©*)) 

Lemma C.3. If is Gaussian noise with variance 5 is a r-dimensional subspace. It holds with 
probability at least 1 — C 2 /M, 




< Cia 


r log M 
mim2n ’ 


where Ci,C 2 are universal constants. 
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Proof. Proof is provided in Section D.4. □ 

In addition, we have the following Lemma (Theorem 1 in Negahban and Wainwright (2012)), which plays 
central role in establishing the RSC condition. 


Lemma C.4. There are universal constants, ki, ^ 2 , Ci,..., C^, such that as long as n > C 2 M\ogM, if the 
following condition is satisfied that 


||A||f ||A||f “ kiTi^JlogM + fc2v^?V0Togl\T’ 


(C.l) 


we have 


||X„(A)||2 

I|A||f 

\/n 

1 /TO 1 TO 2 


7 l|A||f 

8 yfmim 2 


, C'iasp(A) 

^Jn 


with probability at least \ — Cz C^M log M). 


Proof of Corollary 3.6. With regard to the example of matrix completion, we consider a partially observed 
setting, i.e., only the entries over the subset X. A uniform sampling model is assumed that 


~ uniform([mi]), j ^ uniform([m 2 ]). 

Recall that A = 0 — 0*. In this proof, we consider two cases, depending on if the condition in (C.l) 
holds or not. 


1. The condition in (C.l) does not hold. 

2. The condition in (C.l) does hold. 

Case 1. If the condition in (C.l) is violated, it implies that 


|A||^ < ymim2|| A|| 


• IIAII 


kiri\/\ogM + k2y/rfMTogW 


< ymim2(2a*)(|| A'll* 

< 12a* ^/rrnlrrl 2 \\^'\\F 


kiri^/log M + k2'/r2M'^ogM 


kiri^JlogM + k2y/r2MlogM 
y/rn 


where A' = njF'(A) and A" = Iljr^(A), the second inequality follows from || A||oo < ||®||oo + |j0*||oo < 2a*, 
and the decomposibility of nuclear norm that ||A||* = ||A'||» + ||A"||*; while the third inequality is based on 
the cone condition ||A'||* < 5||A"||* and |1A'||* < A/r||A'ljj^. 

Moreover, since HA'Ijp’ < |1 A||f, we obtain that 

(c, 

Case 2. The condition in (C.l) is satisfied. 

If C' 2 Q!sp(A)/i/ri > 1/2, we have 

liAII;^ < < 4^20%/^^^. (C.3) 

yjn V ^ 
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If C 2 Q!sp(^)/v^ < 1/2, by Lemma C.4, we have 


|^n(A)| 




n 477117712 

In order to establish the RSC condition, we need to show that (C.4) is equivalent to Assumption 3.1. 
£„(0* + A) - £„(©*) - (V/:„(©*), A) 

= ^ E ((®* + X*) + ((®*’ X*) - y^f - ^ E ((®*’ x*) - y^) 


(C.4) 


2 n 

2=1 

||^n(E 

n 


Thus, we have that (C.4) establishes the RSC condition, and k(X) = C\l( 2 ra\m- 2 ). 

After establishing the RSC condition, what remains is to upper bound ||X* (e )||2 and 77 “^ (X*(e)) 

By Proposition C.2, we have that with high probability, 


-||X*(e)|| <^6(7 

71 " 

By Lemma C.3, we have that with high probability, 


M log M 

771i77l2?7 ’ 


(C.5) 


\n^,^{T{e))l<Cra^- 


(C. 6 ) 


/ ri log M 

7711771277 

Substituting (C.5) and (C. 6 ) into Theorem 3.4, we have that there exist positive constants C[,C 2 such that 


1 


|©-®1|f < C[ari 


logM 


Y^ril 17712 

Putting pieces (C.2), (C.3), and (C.7) together, we have 


+ C'a 


r 2 M log M 


(C.7) 


1 


v/to 17712 

which completes the proof. 


© — ©*||f < max{a*, tj} 


Cari 


logM 


■C 4 


72 M log M 


□ 


Corollary C.5. Under the conditions of Theorem 3.5, suppose uniformly distributed on X. These exists 
positive constants Ci,..., C 4 , for any t > 0 , if k(X) = C\l{m\m 2 ) > C- and 7 * satisfies 


min ( 7 *)i > v + C 2 <J\/rmim 2 
iGS 


MlogM 


where S = supp((T*), for estimator in (2.2) with regularization parameter 


^ ^ £' 3(1 + y/r)(7^ 


MlogM 

7177117772 ’ 


we have that with high probability, © = ©o, which yields that rank(©) = rank(©o) = rank(©*) = r. In 
addition, we have 


I 


y/mim2 


II© 


®*||f < C 4 ra 
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Proof of Corollary C.5. As shown in the proof of Corollary 3.6, we have «;(X) = Gij{rn\m- 2 )^ together 
with (C.5) and (C.6), in order to prove Corollary C.5, according to Theorem 3.5, what remains is to obtain 
p(X) in Assumption 3.2. It can be shown that Assumption 3.2 is equivalent as 

^|lA||^>i|lX(A)||i 

2 n 

As implied by Lemma C.4, when n > Cfa* > C'|asp(A), we have that with high probability, the following 
holds: 

^|lA|||.>i||X(A)|l2. 

77117712 77 

Thus, p(X) = C^I{m\m- 2 ), which completes the proof. □ 


C.2 Matrix Sensing With Dependent Sampling 


In this subsection, we provide the proof for the results on matrix sensing. In particular, we will first establish 
the RSC condition for the application of matrix sensing, followed by the proof on faster convergence rate and 
more results on the oracle property. 

In order to establish the RSC condition, we need the following lemma (Proposition 1 in Negahban and 
Wainwright (2011)). 

Lemma C.6. Consider the sampling operator of S-ensemble, it holds with probability at least 1 — 
2exp(— 77 / 32 ) that 

In addition, we need the upper bound of r7“^||X*(e)||2, as stated in the following Proposition (Lemma 6 
in Negahban and Wainwright (2011)). 


Proposition C.7. With high probability, there are universal constants Ci,C 2 and C 3 such that 


|X*(e)| 




< C 2 exp ( - ( 73(7771 + m 2 )), 


where 7r(S)2 = sup|[u||,=i,||v|| 2 =i Var(u^Xv). 

Proof of Corollary 3.8. To begin with, we need to establish the RSC condition as in Assumption 3.1. Ac¬ 
cording to Lemma C.6, we have that 

> 75!i3S||A||„ _ 12,(S)(y^+ y//^|lA||.. 

By the decomposibility of nuclear norm, we have that 


IIAIU = IIA'IU + ||A"|U < 6|1 A'lU = 6v^|lA'||^ < 6v^|l A]];^, 

where A' = n^(A) and A" = n_^i(A). 

By substituting (C.8) into Proposition C.6, we have that 


:^(A)||2 

yd 


> 


\/'^min(51) I 


A||i.-72v^7r(S) 






(C.8) 
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Thus, for n > C'ir 7 r^(S)mim 2 /Aniin(S) where Ci is sufficiently large such that 


72v^7r(S) 




< 


Amin(S) 


we have 


which implies that 


X(A)|l2 ^ v'A„,in(S), 


X(A)|li 


Amin(5]) 

64 


l|A|| 


2 

F- 


Therefore, k(X) = Amin(S)/32 such that the following holds. 


^(A)lli 


> 


k{X) 

2 


liAll 


2 

F) 


which establishes the RSC condition for matrix sensing. 
On the other hand, we have 




|U^^UJ|V£„(©*)VJ^V 


*T| 
Si I 


= ||UJ7V£, 




V 


where the second inequality follows from the property of left and right singular vectors 
It is worth noting that V£„(0*)Vg^ S By Proposition C.7, we have that 


||U*TV£„(0*)V*||2 < 2CoaTTii:)^, 

UJ|V£„(0*)VSJ|2 < 2CoaF(S)y^, 


(C.9) 


which hold with probability at lease 1 — Ci exp(—C' 2 'ri). 

The upper bound is obtained directed from Theorem 3.4 and (C.9). Thus, we complete the proof. □ 

Corollary C.8. Under the condition of Theorem 3.5, for some universal constants Ci ,... ,Ce if k(X) = 
C'iAniin(S) > (- and 7 * satisfies 


min|( 7 *)i| > i/+ C'2cr7r(S) 


ieS 


A/nAmin(S) 


where S = supp( 7 *), for estimator in (2.2) with regularization parameter 


A>C3 


1 + 


Aniax(S) 

Amin(5]) 


anil:) + 


we have that 0 = 0 o, which yields that rank( 0 ) = rank( 0 ( 9 ) = rank( 0 *) = r, with probability at least 
1 — (£4 exp(—(£ 5(7711 + 7772 )). In addition, we have 


II© 


0* 


F < 


C6r-7r(S) 
V^Ami„(S) ■ 
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Proof of Corollary C.8. The proof follows from the proof of Corollary 3.8 and Theorem 3.5. As shown in 
the proof of Corollary 3.8, we have k{X) = C'iAniin(S), together with (C.9), in order to prove Corollary C. 8 , 
according to Theorem 3.5, what remains is to obtain p(X) in Assumption 3.2, respecting the example of 
matrix sensing. 

According to Assumption 3.2, we have that p(X) = Aniax(H„), where H„ is the Hessian matrix of £„(■)■ 
Based on the definition of we have 

n 

H„ = n~^ vec(Xi)vec(Xi)^. 

i=l 

Thus E[H„] = S. By concentration, we have that when n is sufficiently large, with high probability, 
Ainax(H„) < 2 Aniax(S), wliich is equivalent to p(X) < 2 Amax(S), holding with high probability, where n is 
sufficiently large. This completes the proof. □ 

D Proof of Auxiliary Lemmas 

D.l Proof of Lemma B.l 

Proof. By the restricted strong convexity assumption (Assumption 3.1), we have 

Cu{&2) > Cu{®i) + (V£„(©i), 02 - 0i) + ^||02 - ©iIIf- (D-1) 

In the following, we will show the strong smoothness of Qa(’)) based on the regularity condition (ii), which 
imposes constraint on the level of nonconvexity of q\{ ). Assume 71 = 7 ( 0 i ),72 = 7 ( 02 ) are the vectors of 
singular values of 0i, 02 , respectively, and the singular values in 71,72 are nonincreasing. For 0i, 02 , we 
have the following singular value decompositions: 

0 i = UiriV^ ©2 = U 2 r 2 vT, 

where ri,r 2 € g^j.g diagonal matrix with Fi = diag( 7 i),r 2 = diag( 72 ). For each pair of singular 

values of © 1 , © 2 : ((7i)i, ( 72 ) 1 ) where i = 1, 2,..., m, we have 

-C-((7i)i - ( 72 ) 0 ^ < ['?a((7i)*) - 9 ^( 72 )*)] ((7i)i - ( 72 ) 0 . 


which is equivalent to 


(( - gUri)) - ( - q'xiT2)),T, - r2) < C-llFi - T^Wl, 


which yields 

(( - VQa(®i)) - ( - VQa(®2)), ©1 - © 2 ) < C-II ©1 - ® 2 \\%- (D.2) 

Since (D.2) is the definition of strongly smoothness of —Q(-), it can be show to be equivalent to the following 
inequality that 

Qa(©2) > Qa(©i) + (VQ(©i), ©2 - © 1 ) - ^||©2 - (D.3) 

Adding up (D.l) and (D.3), we complete the proof. □ 
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D.2 Proof of Lemma B.2 

Proof. By Lemma B.l, we have that 

£„,a(©) + A||©|U - 4.a(®*) - A|I®1U > (V£„,a(©*), ©-©*)+ A|1©|U - A||©*|U. (D.4) 

For the first term on the RHS in (D.4), we have the following lower bound 

(V£„,a(0*), ©-©*) = (V£„.a(©*), n^(© - ©*)) + (V£„.a(©*), (© - ©*)) 

> - ||n^(v£„,A(©*))||2 ■||n^(® - ®*)L 

■v' 

- ||n^.( v£„.a(©*))II 2 -lln^^(© - ©*)||^, (D.5) 

'-V-" 

I2 

where the last inequality follows from Holder’s inequality. 

Analysis of term Ii. It can be shown that V£„(©*) = —X*{e)/n. Based on the condition that A > 
2 n“^|jX*(e)|| 2 , we have that 

||V£„(©*)|!2 < A/2. (D.6) 

Moreover, by condition (iv) in Assumption 3.3 and (D.6), we obtain that 

||n^(V£„,A(©*)) II 2 = ||n^(V£„(©*) + Qxi&*)) II 2 < 3A/2. 

Analysis of term 12 - Since njri(©*) = 0, we have that 

||n^. (V£„,a(©*)) II 2 = ||n^- (V£„(©*)) II 2 < A/2. (D.7) 

Putting pieces (D.6) and (D.7) into (D.5), we obtain 

(V£„,a(©*),© - ©*) > -3A/2||n^(© - ©*)||^ - A/2||n^^(0 - ©*)||^. (D.8) 

Meanwhile, we have the lower bound on A|j©||* — A||©||* that 

A|1©|U - A|i©iu = A||n^(0)||^ + A||n^.(©)||^ - A|l©iu 

> -A||n^(© - ©*)||^ + A||n^r(0 - 0*)||^. (D.9) 

Adding (D.8) and (D.9) yields that 

(V£„,a( 0*),0 - ©*) + A||0|U - A||©|U = -5A/2||n^(© - ©*)||^ + A/2||n^4© - ©*)||^. (D.IO) 
Due to the fact that © is the global minimizer of (2.2), provided the condition that k(X) > C-, we have 

£„,a(0) + A|1©|U - £„.a(®) - A|10*|U < 0. (D.ll) 

Substituting (D.IO) and (D.ll) into (D.4), since A > 0, we have that 

||n^4®-®*)L<5||n^(©-©*)L, 


which completes the proof. 


□ 
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D.3 Proof of Lemma B.3 

Proof. Ao = ®o ~ ®*- According to observation model (2.1) and definition of X(-), we have 

^ n - n 

£„(®o) - /:„(®*) = ^Y.iy^- 

i—l i—1 

- n - n 

= ^||X(Ao)||^-^(X*(e),Ao), 

where X*(e) = ^11=1 is the adjoint of the operator X. Because the oracle estimator ®o minimizes £«(•) 
over the subspace J", while ®* S P, we have £„(®o) — /1„(®*) < 0, which yields 

^|lX(Ao)||B-(:^*(e),Ao). (D.12) 

Zn n 

On the other hand, recall that by the RSC condition (Assumption 3.1), we have 

£„(® + A) > /:„(®) + (V/:„(®), A) + ac(X)/2|| A|||, 


which implies that 
1 


Substituting (D.13) into (D.12), we have 

k(X) 


X(Ao)||^ - yx*(e), Ao) - (V/:„(®*), A) = ^||X(Ao)||^ > A* 


k(X), 


oil;’- 


(D.13) 


2 „Ao|||<^||X(Ao)||^<l(X*(e),Ao). 


Therefore, 




2(n^(X*(e)),Ao) ^ 2||n^(X*(e))||2-|lAo| 


nK(X) 


nK(X) 


where the last inequality is due to Holder inequality. Moreover, since the rank Ao is r, we have the fact that 
||Ao|U < -x/rllAoIlF, which indicates that 


lAo|||< 


2Vf\\n^{X*{e))\\^-\\Ao\\F 


riK^X) 

Therefore, we have the following deterministic error bound 

2Vf ||n^(X*(e)) 11^ 2v^||n^(V£„(®*)) I 


IAoIIf < 


nK{X) 


k(X) 


where the last equality results from the fact that V£„(®*) = —X*(e)/n. 
Thus, we complete the proof. 


□ 


D.4 Proof of Lemma C.3 

In order to prove Lemma C.3, we need the Ahlswede-Winter Matrix Bound. To begin with, we introduce the 
definition of || • \\.^^ and || • followed by some established results on || • and || • \\^^. 
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The sub-Gaussian norm of X, denoted by ||X||^ 2 : is defined as follows 


\\XU,=supp-^/^{E\X\Py/P. 

P>1 


It is known that if E[Jf] = 0, then E[exp(fX)] < exp(C't^|jX||^^) for all t gM.. 
The sub-Exponential norm of X, denoted by ||X||^j, is defined as follows 

||X||^, =supp-i(E|Xr)i/G 

P>1 


By Vershynin (2010), we have the following Lemma. 

Lemma D.l. For Zi and Z 2 being two sub-Gaussian random variables, Z 1 Z 2 is a sub-exponential random 
variable with 


||^i^2lU, <Cmax{||Zi||2^, IIZ 2 IIIJ, 

where C > 0 is an absolute constant. 


Theorem D.2 (Ahlswede-Winter Matrix Bound). (Negahban and Wainwright, 2012) Let Zi,..., Z„ be 
random matrices of size mi x m 2 - Let ||Zi||,^j < K for all i such that ||Zi||,^j is upper bounded by K. 
Furthermore, we have Sf = max { ||E[Z7||E[ZiZ7]||2}) and 6“^ = ^1- Then we have 


Ez, 

2=1 


> t < 17117712 max 


I exp 




Now we are ready to prove Lemma C.3. 

Proof of Lemma C.3. Since U* and V* are singular vectors, for S = J^(U*, V*), we have 


1 

n 


2=1 






2=1 


u 




n 

(!:«■ 


X, V 


Recall that = ej(j)eJ(.). Let Yi = e^X* = eiej(^i)e].^^y We have ||Yi||,/,j < Let Z* = U*^YiV* e 
]^rxr^ have 


||Z,|U, = ||U*^Y,V*||^^. 

Based on the definition of Y^, we have that ||Zi||,^j < Ca. By applying Theorem D.l, we have 

< C'a^. 

Thus, K = Ca'^. 

Furthermore, we have 


E[Z,Z,' ] = E[U* ' Y,V*V 


r*T vT 






Y,' U*] = E[e2u*^e,(,)e^(,)V*Y 

T 


■*T T 


•pi) 


U* 




u * 
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Based on the definition of spectral norm, we have 


||U* ' = max a' u* ' V*V* ' e,( 

l|a||2 = l 

= max bTe,.(,)eT V*V*^e,(,)eJ( 
||b||2 = l 

where the second equality follows by setting b = U*a G In addition, we have 

b^e,(,)e^(,)V*V*^efc(,)ey)b = b,(,)V*vy b,(,) = 
where is the fc-th row of V*. Thus 




7711 '^2 

J2 J2 U*^e,eT V* V*^efceJU* 


^ IIVl IIO-^ 

yy\J*^e,eJ 

" "i '"-j, 

EE 


- 7711 7712 

inax a^ y] y] VW^e^eJU*a 

i=i k=2 


17111712 ||a||2 - 

J 

mi m2 


1 


mi ‘m2 

7 = 1 fc = 2 


mim2 l|b||2=i^^^2 


Since ^ Ikfclli = l|V*||| = r, we obtain that 

\ ^ j\ ) iij mim2 


Therefore, we have 


||E[W^e,d)eT^)VW*^efc(,)ey. 


yTi 


2 

a^r 


and the same result also applies to ||E[Z7 '^i ]\\2 
By applying Theorem D.2, we obtain that 

^ T). 


|E[ZiZ^j|l2 =-, 

' 77117722 


i=l 


> 


Thus, 




with probability at least 1 — C 2 M we have 

< c,a,/E3EE, 

'' 777i7?72 


n 


where M = max(mi, 7712 ). It immediately implies that 

/ r log M 


which completes the proof. 


“ E 

71 „ 

i=i 2 


TOiTO 2 ? 7 ’ 


□ 


References 

Cai, T. T. and Zhou, W. (2013). Matrix completion via max-norm constrained optimization. arXiv preprint 
arXiv:1303.0341 . 


27 




























Candes, E. J. and Recht, B. (2012). Exact matrix completion via convex optimization. Commun. ACM 
55 111-119. 

Candes, E. J. and Tao, T. (2010). The power of convex relaxation: near-optimal matrix completion. IEEE 
Transactions on Information Theory 56 2053-2080. 

Fan, j. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. 
Journal of the American Statistical Association 96 1348-1360. 

Gong, P., Zhang, C., Lu, Z., Huang, J. and Ye, J. (2013). A general iterative shrinkage and thresholding 
algorithm for non-convex regularized optimization problems. In ICML. 

Gunasekar, S., Ravikumar, P. and Ghosh, J. (2014). Exponential family matrix completion under 
structural constraints. In ICML. 

Hu, Y., Zhang, D., Ye, J., Li, X. and He, X. (2013). Fast and accurate matrix completion via truncated 
nuclear norm regularization. IEEE Trans. Pattern Anal. Maeh. Intell. 35 2117-2130. 

Jaggi, M. and Sulovsky, M. (2010). A simple algorithm for nuclear norm regularized problems. In ICML. 

Jain, P., Meka, R. and Dhillon, I. S. (2010). Guaranteed rank minimization via singular value projection. 
In NIPS. 

Jain, P., Netrapalli, P. and Sanghavi, S. (2013). Low-rank matrix completion using alternating 
minimization. In Symposium on Theory of Computing Conference. 

Ji, S. and Ye, J. (2009). An accelerated gradient method for trace norm minimization. In ICML. 

Koltchinskii, V., Lounici, K., Tsybakov, a. B. et AL. (2011a). Nuclear-norm penalization and optimal 
rates for noisy low-rank matrix completion. The Annals of Statistics 39 2302-2329. 

Koltchinskii, V. et AL. (2011b). Von neumann entropy penalization and low-rank matrix estimation. The 
Annals of Statistics 39 2936-2973. 

Liu, D., Zhou, T., Qian, H., Xu, C. and Zhang, Z. (2013). A nearly unbiased matrix completion approach. 
In ECML. 

Loh, P.-L. and Wainwright, M. J. (2013). Regularized m-estimators with nonconvexity: Statistical and 
algorithmic theory for local optima. In NIPS. 

Lu, C., Tang, J., Yan, S. and Lin, Z. (2014). Generalized nonconvex nonsmooth low-rank minimization. 
In 2014 IEEE Conference on CVPR 2014, Columbus, OH, USA, June 23-28, 2014. 

Mazumder, R., Hastie, T. and Tibshirani, R. (2010). Spectral regularization algorithms for learning 
large incomplete matrices. The Journal of Machine Learning Research 11 2287-2322. 

Negahban, S. and Wainwright, M. J. (2011). Estimation of (near) low-rank matrices with noise and 
high-dimensional scaling. The Annals of Statistics 39 1069-1097. 

Negahban, S. and Wainwright, M. J. (2012). Restricted strong convexity and weighted matrix completion: 
Optimal bounds with noise. Journal of Machine Learning Research 13 1665-1697. 

Negahban, S. N., Ravikumar, P., Wainwright, M. J. and Yu, B. (2012). A unified framework for 
high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science 27 538-557. 


28 



Nie, F., Wang, H., Cai, X., Huang, H. and Ding, C. H. Q. (2012). Robust matrix completion via joint 
schatten p-norm and Ip-norm minimization. In ICDM. 

Recht, B., Fazel, M. and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linear matrix 
equations via nuclear norm minimization. SIAM review 52 471-501. 

Rohde, A., Tsybakov, A. B. et AL. (2011). Estimation of high-dimensional low-rank matrices. The 
Annals of Statistics 39 887-930. 

Shalev-Shwartz, S., Gonen, A. and Shamir, O. (2011). Large-scale convex minimization with a low-rank 
constraint. In ICML. 

Srebro, N., Rennie, J. D. M. and Jaakkola, T. (2004). Maximum-margin matrix factorization. In 
NIPS. 

Srebro, N. and Shraibman, A. (2005). Rank, trace-norm and max-norm. In Proceedings of the 18th 
Annual Conference on Learning Theory. Springer-Verlag. 

Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint 
arXiv:1011.3027 . 

Wang, S., Liu, D. and Zhang, Z. (2013a). Nonconvex relaxation approaches to robust matrix recovery. In 
Proceedings of the 23rd International Joint Conference on Artificial Intelligence. 

Wang, Z., Lai, M., Lu, Z., Fan, W., Davulcu, H. and Ye, J. (2014). Rank-one matrix pursuit for matrix 
completion. In ICML. 

Wang, Z., Liu, H. and Zhang, T. (2013b). Optimal computational and statistical rates of convergence for 
sparse nonconvex learning problems. arXiv preprint arXiv:1306.4-960 . 

Weyl, H. (1912). Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen. 
Mathematische Annalen 71 441-479. 

Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of 
Statistics 894-942. 

Zhang, C.-H., Zhang, T. et AL. (2012). A general theory of concave regularization for high-dimensional 
sparse estimation problems. Statistical Science 27 576-593. 

Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American Statistical 
Association 101 1418-1429. 


29 



